Intro to the tidyverse

Lennart Kasserra

CorrelAid

2024-04-22

Contents

  • dplyr-verbs
  • “Piping”

dplyr

When working with data, you have to:

  1. Figure out what to do
  2. Describe those tasks in the form of a computer program
  3. Run the program

dplyr provides a set of simple “verbs” that correspond to the most common data manipulation tasks (like select(), filter(), summarise(), mutate() & arrange()) and a coherent “grammar” to make it easy to translate your thoughts into code.

library(dplyr)

Data

Data comes from the Bigfoot Field Researchers Organization (BFRO) & contains sightings with covariates & an accuracy rating by the BFRO.

bigfoot <- readr::read_csv(here::here("data/bigfoot.csv"))

Select

Pick or drop columns based on their name

select(.data = bigfoot, number, date, season, state, classification)

# Or:
bigfoot |> select(number, date, season, state, classification)

Wait… what is |>???

Piping

The “pipe” (|> or %>%) takes what is on the left-hand side, and hands it to the function on the right-hand side.

  • Read the pipe in your head as and then…

Piping

Take bigfoot, and then select these columns:

bigfoot |> select(date, season, state)

Advanced select

# Dropping columns (negative selection):
bigfoot |> select(-longitude)
bigfoot |> select(-c(longitude, latitude))

# Selecting columns by name patterns:
bigfoot |> select(starts_with("temperature"))
bigfoot |> select(ends_with("tude"))
bigfoot |> select(contains("_"))

# Select by location (`from:to`)
bigfoot |> select(observed:season)

# Select based on condition:
bigfoot |> select(where(is.numeric))

# Select to reorder:
bigfoot |> select(number, county, state, date, everything())

Filter

Keep rows that match a condition

  • Equals (==), not equal (!=), not (!), or (|), and (&) (, also is treated as &)
  • Check for missings with is.na
  • between() & near()
bigfoot |> filter(state == "Alabama")
bigfoot |> filter(date >= "2020-01-01")
bigfoot |> filter(temperature_mid <= 32)
bigfoot |> filter(!is.na(season), state == "Wisconsin")

Laying down pipe

Piping also allows us to combine steps:

bigfoot |> 
  filter(!is.na(season), state == "Wisconsin") |> 
  select(county, season, latitude, longitude)

Take bigfoot, and then filter on the given condition, and then select the given columns.

Laying down pipe

Theoretically, you could also nest function calls:

select(
  filter(.data = bigfoot, !is.na(season), state == "Wisconsin"), 
  county, season, latitude, longitude
)

But the piped version is usually preferrable & more readable! It can be read sequentially, while this has to be read from the inside out…

Exercise time

Subset bigfoot so that:

  • We only have sightings during summer
  • We retain the columns state, temperature_mid & the two coordinate columns (longitude & latitude).

Assignment

Assign the resulting object to a name to keep it:

wisconsin <- 
  bigfoot |> 
  filter(!is.na(season), state == "Wisconsin") |> 
  select(county, season, latitude, longitude)

Theoretically, R also supports right-hand assignment:

bigfoot |> 
  filter(!is.na(season), state == "Wisconsin") |> 
  select(county, season, latitude, longitude) -> wisconsin

…but this is generally considered more of a gimmick & frowned upon (makes it harder to tell when objects are being created).

Mutate

Create or modify columns:

bigfoot |> mutate(temp_celsius = (temperature_mid - 32) / 1.8)

# New columns can be any function of existing ones:
bigfoot |> 
  mutate(
    sub_zero = if_else(temperature_mid < 32, 1, 0),
    year = lubridate::year(date)
  )

Advanced mutate

Using the across()-helper inside mutate, we can apply the same transformation to multiple columns:

fahr_to_celsius <- function(temp) {
  (temp - 32) / 1.8
}

bigfoot |> mutate(across(starts_with("temperature"), fahr_to_celsius))

More concisely, we could just embed the function anonymously:

bigfoot |> 
  mutate(across(starts_with("temperature"), \(temp) (temp - 32) / 1.8))
  • \(temp) is a shorthand for function(temp)

Advanced mutate pt. 2

If you want to keep the original columns, use the .names-argument:

bigfoot <- 
  bigfoot |> 
  mutate(
    across(starts_with("temperature"), function(temp) (temp - 32) / 1.8, .names = "{col}_celsius")
  )

Summarise

Compute summaries:

bigfoot |> 
  summarise(mean_temp = mean(temperature_mid, na.rm = TRUE))

Group by & summarise

  • group_by(): Aggregate or compute summaries by group (here: continent):
bigfoot |> 
  group_by(state) |> 
  summarise(mean_temp = mean(temperature_mid, na.rm = TRUE))

Don’t forget to ungroup() your data (or set .groups = "drop") if you don’t want to do later computations by groups.

You can also group by multiple columns. Here: count() the number of observations by state & season:

bigfoot |> 
  group_by(state, season) |> 
  count()

Advanced group by & summarise

Protip: across() also works inside summarise:

bigfoot |> 
  group_by(state) |> 
  summarise(across(ends_with("celsius"), \(x) mean(x, na.rm = TRUE)))

Arranging

  • arrange(): Order rows (observations) using values of a column:
bigfoot |> arrange(date)
bigfoot |> arrange(desc(temperature_mid))

Renaming

  • rename(): Self-explanatory… The pattern is data |> rename(new_name = old_name)
bigfoot |> rename(report = observed, loc_description = location_details)

Protip: you can also rename inside select() (think: select col as name):

gapminder |> select(date, state, lat = latitude, lon = longitude)

Other useful mini-verbs

  • count(): Count number of observations of unique values:
bigfoot |> count(state)

bigfoot |> 
  count(state) |> 
  arrange(desc(n))
  • n(): Size of current group (like count() for use with group_by() & summarise()):
bigfoot |> 
  group_by(state) |> 
  summarise(obs = n(), mean_temp = mean(temperature_mid_celsius, na.rm = TRUE))
    • slice(): Subset rows by position:
bigfoot |> slice(1:10)
  • distinct(): Keep distinct/unique rows:
bigfoot |> distinct() # drops duplicate rows
bigfoot |> distinct(state) # shows unique obs. of "country" (all countries)

Exercise Time

  • What state had the most sightings? Which are the top 10 states?
  • Which state had the most “confirmed” sightings (classification == "Class A")?
  • Which state had the most sightings in winter? Which in summer?